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Abstract 
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This  paper  describes  a  checkpoint  comparison  and  optimistic  execution  technique  for  error 
detection  and  recovery  in  distributed  and  parallel  systems.  The  approach  is  based  on  lookahead 
execution  and  rollback  validation.  It  uses  replicated  tasks  executing  on  different  processors  for 
forward  recovery  and  checkpoint  comparison  for  error  detection.  Two  schemes  derived  from  this 
strategy  are  analyzed  and  compared  with  triplication  and  voting,  and  with  two  common  backward 
recovery  methods.  The  impact  of  checkpoint  time,  checkpoint  validation  time,  and  process  restart 
time  is  also  examined.  An  implementation  on  a  Sun  NFS  network  with  six  benchmark  programs 
is  presented.  Compared  with  classic  checkpointing  and  rollback  techniques,  our  strategy  provides 
rapid  recovery  and  requires,  on  average,  fewer  processors  than  standard  replication  and  voting 
methods.  This  strategy  is  useful  in  systems  where  spare  processors  are  available  at  the  time  of 
recovery. 
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I.  Introduction 

Considerable  research  ha^  been  devoted  to  checkpoint-based  backward  recovery  schemes  [1-5]. 
There  have  also  been  techniques  proposed  v/hich  combine  replication  with  voting  and  checkpoint 
rollback  recovery  [6-8].  The  RAFT  algorithms  replicate  the  computation  on  two  processors  to 
achieve  error  detection  and  rollback  recovery  [6,7].  If  the  results  produced  by  the  replicated  tasks 
do  not  match,  the  task  is  e.xecuted  on  other  processors  until  a  pair  of  matched  results  is  found. 
Checkpoint- ba^ed  backward  recovery  has  two  drawbacks:  an  execution  time  penalty  due  to  check¬ 
pointing  and  rollback,  and  the  problem  of  determining  if  a  checkpoint  is  error  free.  Although  placing 
checkpoints  optimally  can  reduce  the  execution  time  penalty  to  some  extent,  the  computation  lost 
by  rollback  is  inherent  [2-5].  One  approach  to  validating  a  checkpoint  is  to  validate  the  system  state 
via  concurrent  error  detection  or  system  diagnosis,  before  a  checkpoint  is  taken  [9, 10].  Another  is 
to  simply  keep  a  series  of  consecutive  checkpoints  and  perform  multiple  rollbacks  when  necessary 
[11].  In  contrast  to  backward  recovery,  forward  recovery  attempts  to  reduce  the  lost  computation 
by  manipulating  some  portion  of  the  current  state  to  produce  an  error-free  new  state.  However, 
forward  recovery  generally  depends  on  accurate  damage  assessment,  a  correction  mechanism,  and 
sometimes  massive  redundancy  (e.g.,  NMR)  [1,12]. 

In  this  paper,  we  present  a  checkpoint- based  error  detection  and  optimistic  recovery  strategy 
for  parallel  and  distributed  systems.  This  strategy  requires  neither  application-specific  error  cor¬ 
rection  nor  massive  static  NMR  for  error  masking  to  achieve  forward  recovery.  It  uses  checkpoint 
comparison  for  checkpoint  validation  and  optimistic  e.xecution  for  forward  recovery. 

The  following  section  describes  our  recovery  strategy;  the  subsequent  section  discusses  the 
performance  evaluation  of  the  recovery  schemes  derived  from  our  strategy.  Section  TV  presents  an 
experimental  evaluation  with  a  distributed  implementation  on  a  Sun  NFS-ba.sed  network. 
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II.  Recovery  with  Optimistic  Execution  Using  Checkpoints 


A.  Computation  and  System  Model 


The  system  considered  consists  of  homogeneous  processing  elements  connected  to  each  other 
and  to  secondary  storage  by  a  network.  The  processing  element  can  be  either  a  computer  node  in 
a  distributed  system  or  a  CPU  node  in  a  multiprocessor  system.  The  network  can  be  an  LAN  for 
a  distributed  system  or  a  general  connection  network  for  a  parallel  system.  We  assume  that  the 
necessary  checkpoints  are  retained  on  a  reliable  second  storage  and  are  accessible  through  the 
interconnection  network. 

A  task  is  an  independent  computation  and  it  can  be  a  group  of  related  subtasks.  A  task  is 
divided  into  a  series  of  sequential  subcomputation  sessions  by  checkpoints.  A  process  is  the  task 
running  on  a  processing  element,  A  process  can  be  replicated  on  different  processors. 

A  checkpoint  consists  of  two  types  of  information:  the  current  process  state  for  process  restart 
and  the  test  information  for  process  state  validation.  The  state  and  test  information  may  or  may 
not  be  separate  entities  within  the  checkpoint.  If  the  checkpoint  is  the  complete  run-time  process 
image,  the  test  information  can  be  the  image  itself  or  the  signature  of  the  image.  Checkpoint 
comparison  is  used  to  detect  erroneous  system  state  or  validate  the  checkpoint.  This  implies  that 


the  probabilities  of  the  two  checkpoints  being  identical  as  a  result  of  one  or  two  erroneous  processes 


are  negligible. 

This  paper  only  deals  with  the  faults  that  cause  an  error  in  a  process  and  result  in  an  erroneous 
checkpoint.  Faults  in  the  processor  interconnection  network  or  the  secondary  storage  may  not  be 


detectable  nor  recoverable  in  our  approach.  We  also  assume  a  reli 
checkpoint  comparisons. 
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Figure  1.  Lookahead  Execution  and  Rollback  Validation. 
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B.  Lookahead  Execution  and  Rollback  Validation 

Two  essential  features  of  our  recovery  strategy  are  lookahead  execution  to  reduce  the  compu¬ 
tation  loss  due  to  recovery  and  rollback  validation  to  diagnose  the  correctly  scheduled  lookahead. 
These  concepts  are  illustrated  in  Figure  1.  As  in  the  RAFT  scheme,  a  task  is  replicated  and 
executed  concurrently  on  two  different  processors  [6,7].  At  the  end  of  one  computation  session, 
two  checkpoints  are  produced  by  the  replicated  process  pair.  A  voter  process  compares  the  newly 
generated  but  uncommitted  checkpoints  to  determine  if  the  process  state  is  error  free.  If  the  two 
checkpoints  are  identical,  the  system  state  is  valid.  Either  of  the  checkpoints  can  be  committed  for 
the  past  computation  session  and  the  process  pair  advances  to  the  next  session. 

If  the  uncommitted  checkpoints  disagree,  then  the  checkpoints  contain  an  erroneous  state. 
Instead  of  rolling  back,  two  identical  task  processes  are  started  from  the  unconiniitted  checkpoints 
on  two  additional  processors.  This  optimistic  scheduling  is  called  lookahead  (optimistic)  e.xecution. 
Meanwhile,  another  process  rolls  back  to  the  last  committed  checkpoint  on  a  fifth  processor.  After 
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a  checkpoint  interval,  A,  a  diagnosis  checkpoint  is  produced  by  the  rollback  (validation)  process. 
This  checkpoint  is  compared  to  the  two  disagreeing  (uncommitted)  checkpoints.  If  there  is  a  match, 
the  error-free  checkpoint  is  identified  and  committed.  The  process  pair  that  was  executed  ahead 
from  the  disagreeing  erroneous  checkpoint  and  the  rollback  validation  process  are  terminated.  In 
this  strategy,  the  two  additionally  scheduled  lookaheads  make  it  possible  not  to  roll  back  the  whole 
system  when  there  is  an  error  during  lookahead  executions.  In  this  case,  the  lookahead  pair  from 
the  new  verified  checkpoint  is  treated  as  the  normal  pair.  This  pair  can  start  a  new  round  of 
lookahead  and  rollback  validation  without  rolling  back  the  whole  system. 

Compared  to  the  static  redundancy  of  three  processors  for  TMR,  this  strategy  uses  two  proces¬ 
sors  for  the  common  error-free  situation  and  a  dynamic  redundancy  of  five  for  the  rare  occurrence 
of  an  error.  The  potential  for  forward  recovery  lies  in  the  fact  that  there  should  be  at  least  one 
correct  process  (thus,  one  valid  checkpoint)  during  the  normal  run,  since  the  lookahead  e.xecution 
from  this  valid  checkpoint  advances  the  computation  without  rollback.  However,  rollback  may  not 
be  avoidable  when  the  diagnosis  checkpoint  does  not  agree  with  either  of  the  two  uncommitted 
disagreeing  checkpoints,  since  all  optimistic  executions  may  be  incorrect. 

C.  Recovery  Design  Considerations 

With  respect  to  lookahead  and  rollback  scheduling,  there  are  four  critical  parameters  in  de¬ 
signing  a  specific  recovery  scheme  based  on  the  approach  described.  The  first  is  the  number  of 
replicated  processes  in  the  normal  run.  which  we  call  base  (redundancy)  size.  The  larger  the  base 
size,  tut:  mrtr*"  potential  there  is  for  foiw.ud  recovery,  since  it.  is  likely  to  h;vve  an  error-free  check¬ 
point  for  successful  lookaheads.  The  second  is  the  validation  size  or  the  number  of  processes  used 
for  rollback  validation;  the  third  is  the  validation  depth  or  the  number  of  retries  of  the  rollback 
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validation  process  if  a  rollback  validation  fails  to  diagnose  the  disagreeing  checkpoinlo.  We  can  use 
either  a  larger  validation  size  or  a  larger  validation  depth  to  increase  the  diagnosis  success  rate.  In 
the  case  of  a  larger  validation  depth,  the  rollback  validation  success  rate  is  increased  by  using  time 
redundancy.  The  fourth  is  the  lookahead  size  or  the  number  of  lookahead  processes  scheduled. 

A  forward  recovery  scheme  is  recursive  if  its  validation  depth  is  unlimited.  In  this  case,  the 
processes  executed  ahead  can  spawn  their  children  of  lookahead  and  validation  tasks  unboundedly, 
as  the  validation  retries  increase.  If  multiple  failures  during  a  checkpoint  period  are  rare,  then  it  is 
unlikely  that  recursive  validation  and  lookahead  process  spawning  wiU  be  required.  A  nonrecursive 
scheme  is  an  approximation  of  its  recursive  counterpart.  In  fact,  it  is  the  corresponding  recursive 
scheme  with  all  validation  re-tries  greater  than  the  validation  depth  truncated.  If  the  processor 
resource  is  limited,  limiting  the  lookaheads  scheduled  (lookahead  size  <  base  size)  leads  to  graceful 
performance  degradation  [13].  At  one  extreme,  the  recovery  scheme  degenerates  into  a  normal 
rollback  scheme  such  as  RAFT  if  the  lookahead  size  is  zero  [6,7]. 

D.  Limitations 

Our  approach  is  useful  in  systems  where  extra  spare  processors  are  available  at  the  time 
of  recovery.  Other  limitations  to  our  approach  are  the  requirement  that  an  error  results  in  an 
erroneous  checkpoint,  and  the  implementation  requirement  that  the  checkpoints  generated  from 
different  processors  for  the  same  computation  should  be  identical  if  there  is  no  error. 

III.  Performance  Evalu.‘\tion 

In  our  analytical  and  experimental  evaluation,  three  types  of  overhead  are  considered:  check¬ 
point  time  {tk),  process  restart  time  {1^}  and  checkpoint  testing  time  (/<).  For  purposes  of  analysis. 


6 


constant  checkpoint  intervals  and  overheads  are  used.  Each  processor  has  a  constant  probability 
of  failure,  pj,  during  one  computation  session  (A  +  4}  with  or  without  restart  and  checkpoint 
test.  This  assumption  implies  two  requirements.  The  first  is  a  Poisson  distribution  for  the  failure 
distribution,  while  the  second  is  ft  -C  A  +  th  and  tr  ^  A  +  tk  since  the  probability  of  failure  over 
[0,A  4-  tfc]  is  required  to  be  equal  to  that  over  [0,A  +  4  +  +  4]-  The  typical  test  time  tt  and 

restart  time  are  in  the  order  of  a  fraction  of  a  second  and  the  checkpoint  interval  A  on  the  order 
of  minutes  or  hours. 

In  order  to  consider  the  impact  of  the  centralized  file  server  that  handles  checkpoint  files,  we 
assume  that  4  and  tr  are  approximately  n-fold,  when  the  n  processes  access  their  checkpoint  files 
at  the  same  time.  This  assumption  enables  us  to  study  the  impact  of  a  file  server  by  adjusting  tk 
and  4,  since  both  restart  time  4  and  checkpoint  time  4  will  be  increased  due  to  the  file  accesses 
to  a  single  server.  The  increase  in  4  and  4-  may  not  be  proportional  to  the  number  of  processes 
that  access  the  same  file.  However,  a  checkpoint  file  usually  contains  many  blocks.  A  fair  server 
policy  guarantees  that  the  n  processes  finish  their  access  to  the  checkpoint  file  at  approximately 
the  same  time.  We  also  assume  that  checkpoint  comparison  is  performed  by  a  voter  process  on  a 
host  that  can  access  the  file  system  locally,  and  thus  is  not  changed. 

A.  Performance  Metrics 

The  performance  measures  we  examine  in  this  paper  are: 

•  Relative  Execution  Time,  Re',  the  ratio  of  the  expected  execution  time  (Te)  over  the  error-free 
execution  time  (To).  This  measure  normalizes  the  effect  of  the  execution  time  for  different 
computations.  If  R^  is  close  to  one.  the  execution  time  will  be  close  to  the  error-free  execution 


time. 
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•  Number  of  Processors ,  Np-.  the  average  number  of  processors  used  by  the  system  or  ^  '  Np{t)dt. 

It  describes  the  processor  redundancy  required  by  a  recovery  scheme.  The  maximal  instan¬ 
taneous  Np{t)  reflects  the  maximal  processor  requirement. 

•  Number  of  Checkpoints,  N^.  the  average  number  of  checkpoints  stored  in  the  system  or 
%  fo"  Nc{t)dt.  The  maximal  instantaneous  Ndt)  reflects  the  maximal  storage  requirement. 

B.  Alternative  Recovery  Schemes 

VVe  examine  two  alternative  schemes  derived  from  our  proposed  recovery  strategy  and  three 
other  common  schemes.  These  five  recovery  schemes  are  characterized  in  Table  1.  Both  DMR- 
F-1  and  DMR-F-2  are  nonrecursive  schemes  derived  from  our  recovery  strategy.  Their  rollback 
validation  is  limited  to  one  try  with  one  or  two  rollback  validation  processes.  The  TMR-F  scheme 
is  the  common  TMR  forward  recovery  scheme  using  error  masking  and  majority  voting.  The  DMR- 
B-1  and  DMR-B-2  schemes  are  recursive  rollback  .schemes  modified  from  the  RAFT  algorithms  [6,  7]. 
They  use  two  processes  for  the  normal  e.xecution.  If  the  checkpoints  match  after  checkpointing,  the 
e.xecution  advances  to  the  next  session.  If  there  is  no  match,  one  or  two  validation  processes  roll 
back  repeatedly  until  a  matched  checkpoint  pair  is  obtained.  These  two  recursive  rollback  schemes 
have  the  best  performance  among  all  their  nonrecursive  approximations. 

C.  Discussions 

The  derived  analytical  model  for  DMR-F-1  and  DMR-F-2  is  presented  in  Appendix  A.  The 
results  are  summarized  in  Tables  2  and  3  (the  symbol,  fs,  denoted  for  the  corresponding  cases  with 
a  centralized  file  server).  The  analysis  for  TMR-F.  DMR-B-1  and  DMR-B-2  is  very  similar  [13]. 
Generally,  R^,  Np  or  Nc  can  be  expressed  in  terms  of  the  relative  overhead  factors  as 
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Table  1.  Five  Schemes  Compared. 

DMR-F-1: 

a  nonrecursive  forward  recovery  scheme  with  base  size  =  2,  validation 
size  =  1,  validation  depth  =  1  and  lookahead  size  =  2. 

DMR.F-2: 

a  nonrecursive  forward  recovery  scheme  with  a  base  size  =  2  valida¬ 
tion  size  =  2,  validation  depth  =  1  and  lookahead  size  =  2. 

TMR-F: 

a  triple  module  redundancy  forward  recovery  scheme  (base  size  =  3) 
with  validation  size  =  0,  validation  depth  =  0  and  lookahead  size  = 

0. 

DMR-B-1: 

a  recursive  backward  recovery  scheme  with  base  size  =  2,  validation 
size  =  1,  validation  depth  =  00  and  lookahead  size  =  0. 

DMR-B.2: 

a  recursive  backward  recovery  scheme  with  base  size  =  2,  validation 
size  =  2,  validation  depth  =  oc  and  lookahead  size  =  0. 

where  c  is  a  constant,  m  £  {R^,  Nc,  Np)  and  <],.  is  either  2tk  or  Ztk-  The  constant  c  is  the  error-free 
part  of  the  performance  measure,  while  a  is  the  performance  degradation  due  to  rollbacks  in  the 
schemes  we  considered.  The  smaller  o  is  for  a  recovery  scheme,  the  more  effective  it  is  in  terms 
of  reducing  the  execution  time  degradation.  In  this  paper,  a  is  called  the  coefficient  of  overhead 
due  to  rollback.  This  rollback  overhead  can  not  be  eliminated  and  depends  only  on  the  failure 
probability,  pj.  The  expression  c  -|-  a  represents  the  inherent  performance  of  a  particular  scheme. 
The  factors  of  /3,  7,  and  <5  are  the  overhead  coefficients  for  process  restart,  checkpoint  comparison 
and  checkpointing. 

For  Rf.,  the  overhead  coefficients,  a,  3.  7  and  b  are  related  only  to  p/.  For  Np  and  iV^,  the 
coefficients  also  include  a  factor  of  .Normally,  the  relative  overheads  are  very  small,  and  we  can 


approximate  R^  with  the  zero-overhead  R^-  This  approximation,  in  fact,  gives  the  upper  bound 
for  a.  J.  7  and  b  in  Np  and  .\V.  since  the  presence  of  tr.  ti  and  tk  increases  R^.  Therefore.  Np  and 
N'c  are  approximately  a  linear  function  of  overhead  factors.  The  overhead  coefficients  represent  the 
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Table  2.  Analytical  Evaluation  Summary:  DMR-F-1. 


PI 

2pf{  1  -  pj  f 

Pr 

2pj(  1-  p/)  +  pj 

T, 

n(A  +  tk)  (l  + 

Re 

1  1  2pr  I  Pl+2Pr  ir  1  2.5p,-|-3pr  t, 

^  1-pr  ‘  l-pr  '  1-pr  A-t-tfc 

Nc 

1  1  O  Pl  +  Pr  ,  o  Pl  +  Pr  tr  i  o  6.25p|-(-8pr  ti 
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o  1  o  Pl+Pr  1  O  Pl+Pr  tr  1  ■>  1.5pi-)-2pr  ft 

^  '^(l-pr)R,  '  -^(l-prlfl.  A-(-ifc  '  '^(1-pr)/?,  A-ff* 

max(Np) 

5 

Reifs) 

1  1  2pr  1  Pi-f-5/3pr  3tr  1  2.5p/-(-3pr  ti  1  Pl+Pr  3ik 

^  1-Pr  '  I-Pr  A-f2«*  '  1-Pr  A-H2(*  '  1 -pr  A-(-2«fc 
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1  1  -1  Pt+I>r  1  Pl+Pr  3tr  i 
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Np(fs) 

1  >1.5p(-(-2pr  t,  1  .}  PcfPr  34 

^'^{l-Pr)Rr  t\+2tk  '  -^(I-Prlfl,  A-l-24 

contribution  of  their  corresponding  overhead  factors  to  the  performance  degradation.  The  larger 
the  coefficient  for  an  overhead  factor,  the  more  important  this  overhead  factor  is  with  respect  to 
performance  degradation. 

For  the  noncentralized  file  server  situation.  S  is  zero.  The  checkpoint  time,  t/t,  does  not  appear 
as  an  overhead  factor  because  an  error-free  execution  time  that  includes  checkpoint  time.  n(A-f  tfc), 
is  used  as  the  base  for  our  performance  measures.  The  overhead  coefficient  for  the  checkpoint  time 
is  c  -I-  Q  if  the  checkpoint-less  error-free  execution  time  is  used  as  the  base  for  the  performance 
measures.  For  example,  can  be  redefined  as  the  ratio  of  the  expected  execution  time  over  the 
checkpointless  error-free  execution  time,  instead  of  ^  .  That  is, 


Table  3.  Analytical  Evaluation  Summary:  DMR-F-‘2 
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^  ^  ^  (l-Pr)fir  A+tfc  ^  ^  (1-pr)/?,  A  +  l* 

max(  Np) 

6 

Reifs) 

,  1  p,+2pr  1  pi+5/3p,+5/3pr  3tr  1 

l-pr  '  1-Pr  A  +  2tfc'^ 

1  3.5p,  +  5p,+5pr  t,  1  Pl+Pj+Pr  3tt 

1-Pr  A  +  2t*  '  1-pr  A  +  2tt 

yAfs) 

1  1  O  Pl  +  Pr+Pr  I  .1  Pl+P,+Pr  3tr  , 

'  ^(l-pr)/?r(/.»)  '  ‘■(l-pr)flr(/5)  A  +  2tt 

1  o  1  lpi+2t.5p,+21.5pr  t,  1  Pl+P, +Pr  3tl, 

(l-Pr)ftr(/i)  A  +  2tk  '  ‘•(l-pr)flr(/3)  A  +  2tk 

Np(fs) 

2  1  iPI+P-'+Pr  ,  iPl-t-Pr+Pr  3tr  i 

'^^^(l-pr)flr  '  *(l-pr)flr  A  +  2t*^ 

,  1  2.5pi+5.5p,+5.5pr  tt  1  |Pi+Pj+Pr  3tt 

(I-Pr)flr  A  +  2t*  '  ‘  (l-pr)A,  A  +  2tfc 

_ 
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R'  -  ^  -  c  I  g  I  I  g)^*'  I  3^^  I  7^* 

"'"nA“A  +  t,  A  A  -^  +  «  +  '«‘^  +  «)^+'^^+7^. 


Checkpointing  overhead  is  inherent  in  any  checkpoint- based  scheme  since  checkpoint  time  is  always 
included  in  the  execution  whether  a  fault  occurs  or  not.  To  minimize  the  impact  of  this  overhead, 
the  optimal  checkpoint  interval  or  frequency  should  be  utilized. 

The  performance  degradation  due  to  a  centra'ized  file  server  is  reflected  in  two  ways.  The  first 
is  the  increased  overhead  caused  by  the  file  access  serialization.  For  example,  an  appro.ximate  factor 
of  3  appears  for  the  restart  overhead  term,  in  Rg{fs).  The  second  is  the  nonzero  overhead 

coethcient  for  checkpoint  time  ((i)  because  of  the  extra  checkpointing  .activities  by  the  lookahead 
and  rollback  validation  processes  during  recovery. 


D.  Comparison 

In  order  to  compare  the  five  schemes  we  described  above,  R^,  Np,  and  are  plotted  in 
Figures  2,  3  and  4.  The  solid  curves  depict  the  zero-overhead  case  (i.e.,  the  inherent  performance, 
c  -f  a),  whereas  the  dotted  curves  depict  the  case  with  -5%  overheads  (e.g.,  4,  tr  and  tt  are  5%  of 
A  -I-  4,  respectively). 

In  Figure  2,  the  expected  execution  times  for  DMR-F-1  and  DMR-F-2  are  comparable  to  that 
for  TMR-F.  In  fact,  their  execution  time  i.':  nearly  the  same  as  the  error-free  execution  time.  The 
execution  t'mes  for  the  rollback  schemes  (DMR-B-1  and  DMR-B-2)  can  be  as  high  as  20  percent 
more  tlian  the  error-free  execution  rime.  Tlie  increase  in  R^  with  pj  sliows  that  rollback  is  still 
possible  in  TMR-F.  D.MR-F-l  and  D.MH -F-i.  even  thougli  these  =chemes  can  perform  forward 
recovery.  For  D.MR-F-l.  R^.  is  larger  than  that  for  DMR-F-2  because  there  are  more  rollback 
validation  failures  in  Fi.IR-F-l 
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The  average  number  of  processors  used  for  DMR-F-1  and  DMR-F-2  is  less  than  that  of  TMR 
(Figure  3).  Using  more  than  three  processors  dynamically  for  the  infrequent  error  situation  enables 
DMR-F-l  and  DMR-F-2  to  reduce  the  overall  processor  redundancy.  As  expected,  the  rollback 
schemes,  DMR-B-1  and  DMR-B-2,  use  on  average  fewer  processors  than  the  others.  For  DMR-B-1, 
Np  decreases  with  pj  because  only  one  proces.sor  is  used  during  recovery. 

The  number  of  checkpoints  increases  with  pf  for  all  schemes  except  TMR-F.  For  TMR-F. 
Nc  is  close  to  one.  For  DMR-F-1  and  DMR-F-2.  Nc  is  slightly  higher  than  that  for  DMR-B-1 
and  DMR-B-2.  It  seems  contradictory  to  the  fact  that  more  checkpoints  would  be  accumulated 
during  recovery  for  DMR-B-1  and  DMR-B-2.  However,  DMR-B-1  and  DMR-B-2  do  have  a  smaller 
Nc  than  DMR-F-1  and  DMR-F-2  because  they  have  a  longer  execution  time  than  DMR-F-1  and 
DMR-F-2  due  to  rollbacks.  The  difference  in  Nc  may  be  insignificant,  since  most  modern  systems 
usually  have  a  large  secondary  storage  for  the  checkpoint  files. 

As  expected,  the  presence  of  overhead  increa.ses  R^.  Both  DMR-F-1  and  DMR-F-2  still  have 
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an  execution  time  close  to  the  error-free  execution  time  (within  5%  for  DMR-F-2  and  10%  for  DMR- 
F-1).  For  DMR-F-1  and  DMR-F-2,  Np  is  increased  less  than  1%  because  the  extra  processors  are 
used  only  during  recovery.  Np  for  TMR-F  and  DMR-B-2  are  constant,  since  they  always  use  three 
and  two  processors,  respectively,  during  both  normal  execution  and  recovery. 

E.  Overhead  Impact  on  Performance 

The  impact  of  the  checkpoint  overhead  is  determined  by  c  -|-  a  and  depicted  in  Figures  2,  3 
and  4  as  the  zero-overhead  curves.  The  impact  of  checkpoint  overhead  on  Re  for  DMR-F-1,  DMR- 
F-2  and  TMR-F  is  smaller  than  that  for  D.MR-B-1  and  DMR-B-2.  This  is  because  the  rollback 
reduction  in  DMR-F-1,  DMR-F-2  and  TMR-F  leads  to  fewer  checkpointing  sessions  in  computation 
(Figure  2).  For  DMR-F-1  and  DMR-F-2,  Np  is  more  sensitive  to  the  checkpoint  overhead  than  that 
for  TMR-F,  DMR-B-1  and  DMR-B-2  as  indicated  by  a  positive  slope  in  Figure  3.  The  static 
redundancy  employed  in  TMR-F  and  DMR-B-2  is  reflected  by  the  flat  slopes  in  Figure  3.  E.xcept 
for  TMR-F,  the  sensitivity  of  Nc  to  the  checkpoint  overhead  is  reflected  by  the  relatively  steep 
slopes  in  Figure  4. 

Figures  5,  6  and  7  compare  the  overhead  coefficients  for  restart  time  and  checkpoint  compari¬ 
son  time.  The  solid  curves  represent  the  impact  of  tr',  the  dotted  ones  depict  the  impact  of  <<.  The 
impact  of  the  comparison  time  ti  is  more  than  twice  that  of  the  restart  time  4-  This  suggests  that 
any  decrease  in  compari.son  time  will  result  in  a  bigger  gain  in  performance  improvement  than  will 
an  equal  decrease  in  restart  time.  In  Figure  o,  tr  and  /j  affect  Rg  for  DMR-F-2  and  DMR-B-2  more 
than  Re  for  other  schemes;  TMR-F  is  in.sensitive  to  both  tr  and  t(.  For  DMR-F-1  and  DMR-F-2,  tr 
and  tt  affect  Np  more  than  for  other  schemes  because  both  schemes  employ  extra  processors  during 
recovery  (Figure  6).  As  indicated  by  the  large  slopes  of  the  dotted  curves  in  Figure  7,  the  number 
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Figure  6.  Overhead  Impact  on  Number  of  Processors 
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Figure  7.  Overhead  Impact  on  Number  of  Checkpoints. 


of  checkpoints  for  all  schemes  except  TMR-F  is  sensitive  to  the  checkpoint  comparison  overhead. 

The  impact  of  a  centralized  file  server  is  depicted  in  Figure  8  for  a  case  with  5  %  overheads. 
The  solid  curves  are  for  the  centralized  server  case,  while  the  dotted  ones  for  the  noncentral  server 
case.  The  impact  of  a  single  file  server  for  TMR-F,  DMR-B-1,  and  DMR-B-2  is  not  as  significant 
as  that  for  DMR-F-1  and  DMR-F-2,  since  there  are  additional  checkpoint  operations  and  restarts 
by  the  lookahead  and  rollback  validation  processes  during  recovery  for  DMR-F-1  and  DMR-F-2. 


F.  Checkpoint  Placement 

The  formulas  for  Te  in  Tables  2  and  .3  can  be  used  to  minimize  the  impact  of  checkpoint 
time  on  execution  time  by  selecting  the  proper  checkpoint  interval  or  frequency.  Figure  9  shows  the 
expected  execution  time  under  different  failure  rates  and  overhead  costs  for  DMR-F-1.  The  optimal 
checkpoint  frequency  can  be  obtained  by  either  numerical  or  graphical  means,  given  a  failure  rate, 
task  computation  time,  and  overhead  costs  such  as  checkpoint  time,  restart  time,  and  comparison 


time. 

Note  in  Figure  9  that  for  a  low  checkpointing  overhead,  the  execution  time  curve  near  the 
bottom  is  rather  flat.  This  suggests  that  an  exact  checkpoint  interval  is  not  necessary  since  a  few 
additional  checkpoints  still  give  a  near  optimad  solution.  For  small  failure  rates,  the  checkpoint 
interval  is  usually  large  or  checkpoint  frequency  is  small.  This  observation  agrees  with  the  previous 
studies  on  optimal  checkpoint  placement  for  other  recovery  schemes  [2-5]. 

IV.  Experiment.al  Implementation  Evaluation 

A.  Distributed  Implementation 

In  this  section,  we  discuss  our  DMR-F-1  implementation  for  a  distributed  system  consisting 
of  a  Sun  3/280  server  and  a  pool  of  12  Sun  3/50  diskless  workstations.  The  server  provides  a  Sun 
NFS  transparent  access  to  remote  file  systems  under  SunOS  4.0.  A  voter  task  for  the  checkpoint 


comparison  and  recovery  initiation  runs  on  this  server.  All  checkpoints  are  kept  by  the  server.  The 
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Figure  9.  Optimal  Checkpoint  Placement. 

Sun  3/50  workstations  are  used  as  the  processing  units.  This  setting  makes  it  possible  to  evaluate 
the  impact  of  the  centralized  file  server.  Our  implementations  are  entirely  user  level  with  no  kernel 
modifications  required. 

A  checkpoint  used  in  our  implementation  is  a  snapshot  of  a  process  run-time  image  at  the 
time  of  checkpointing.  There  has  been  considerable  research  concerning  checkpoint  construction 
in  UNIX  [14-17].  Smith  implemented  a  mechanism  for  checkpoint  construction  in  UNIX  for  the 
purpose  of  process  migration  [14].  His  checkpoint  is  an  executable  file  generated  by  a  checkpoint 
operation.  It  contains  the  text  segment,  the  data  segment,  as  well  as  the  stack  segment  of  the 
process  state.  The  stack  segment  is  treated  as  a  part  of  the  data  segment.  The  processor  state 
(e.g.,  registers)  is  saved  by  a  setjmp()  system  call.  The  restart  of  the  checkpointed  process  is  simply 
the  reexecution  of  this  executable  file  on  another  processor.  Li  and  Fuchs  developed  a  checkpointing 
scheme  for  their  compiler-assisted  checkpoint  insertion  techniques  [16].  Their  checkpoint  is  a  data 
file  that  contains  the  data  segment  and  partial  stack  segment  of  the  checkpointed  process.  The 
checkpoint  is  intended  for  use  in  the  same  shell  process  on  the  same  machine.  Our  implementation 


19 


uses  a  checkpoint  structure  similar  to  that  of  Li  and  Fuchs.  In  addition  to  having  the  complete  stack 
and  data  segments,  our  checkpoint  also  contains  a  segment  for  the  file  I/O  output  data  during  that 
checkpoint  interval.  The  omission  of  the  te.xt  segment  is  possible  because  the  original  executable 
file  is  already  available  through  NFS.  There  is  no  need  to  transfer  the  executable  file  to  perform  a 
remote  restart. 

Two  problems  have  to  be  overcome  for  any  recovery  scheme  that  uses  checkpoint  comparison: 
the  remote  restartability  and  comparability  of  a  checkpoint.  That  is,  a  task  must  be  able  to  be 
restarted  from  a  checkpoint  produced  on  other  nodes,  and  a  checkpoint  produced  on  a  node  must  be 
identical  to  any  checkpoint  from  any  other  nodes  if  both  are  correct  and  for  the  same  computation. 
The  former  is  required  for  process  replication  (lookahead  execution),  while  the  latter  is  needed  for 
checkpoint  validation. 

The  uniform  virtual  memory  layout  of  UNIX  in  homogeneous  machines  provides  the  basis  for 
the  restart  of  a  checkpointed  process  on  a  remote  node.  However,  some  user  process  information 
is  usually  kept  in  the  kernel  for  efficiency.  A  checkpoint  without  this  information  may  not  be 
restartable  even  for  the  same  kernel.  One  example  is  the  file  I/O  information  in  the  file  descriptor 
table  in  the  kernel.  When  a  process  terminates  or  aborts,  this  information  is  cleared  by  the  kernel. 
Restarting  a  process  from  a  checkpoint  without  reestablishing  this  information  in  another  kernel 
makes  a  local  file  descriptor  in  a  user  program  meaningless. 

To  make  a  checkpoint  remotely  restartable,  the  user  information  kept  in  the  kernel  has  to 
be  e.xtracted  during  checkpointing  and  reestablished  in  the  new  kernel  at  restart  [14,15].  A  set  of 
library  routines  was  developed  for  file  I/O  operations.  The  library  keeps  extra  data  as  a  part  of 
the  checkpoint,  such  as  file  name,  access  mode,  and  file  position,  associated  with  the  opened  files. 
During  checkpointing,  all  file  buffers  are  flushed  for  opened  files,  and  the  file  positions  are  updated 
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and  stored  in  the  checkpoint.  During  a  restart,  those  files  are  reopened  and  repositioned  according 
to  the  previously  saved  information  in  the  checkpoint.  In  this  manner,  the  attributes  of  file  I/O 
can  be  saved  and  restored  easily  across  the  network.  However,  the  checkpoint  may  still  not  be 
restartable  even  with  the  complete  information  of  a  user  process  state,  since  some  state  attributes 
are  kernel- dependent.  They  cannot  be  saved  and  carried  across  kernels  (i.e.,  nodes)  in  a  sensible 
fashion  [14,15].  Examples  are  process  group,  signal  received,  the  value  of  real-time  clock,  and  any 
children  that  the  process  may  have  spawned  with  fork().  Similar  to  CONDOR  and  Smith,  our 
current  implementation  assumes  that  for  restartability  a  program  may  not  use  or  depend  on  those 
kernel-dependent  attributes  that  have  partial  information  internal  to  the  operating  system  other 
than  file  I/O. 

The  kernel-dependent  attributes  also  cause  checkpoints  to  be  incomparable,  even  if  these 
checkpoints  are  all  valid.  For  example,  the  value  of  the  real-time  clock  for  different  kernels  may 
be  different,  since  these  clocks  are  seldom  synchronized.  The  valid  checkpoints  from  the  same 
e.xecution  on  different  nodes  may  not  be  the  same  if  the  program  has  these  attributes  as  a  part  of 
its  memory  space.  For  those  kernel-dependent  attributes,  we  enforce  the  following  restrictions  to 
make  the  checkpoint  comparable:  we  can  eliminate  the  use  of  variables  to  store  such  kernel-specific 
attributes,  or  carefully  place  them  in  local  variables  (on  the  stack)  whose  scope  does  not  include 
a  checkpoint  operation,  or  clear  these  variables  before  checkpointing.  Fortunately,  most  numerical 
applications  seldom  use  kernel-dependent  values  except  file  I/O,  and  thus  meet  the  restrictions  we 
put  on  checkpoint  restartability  and  comparability. 
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B.  Experiments 

B.l.  Benchmark  Programs 

Four  scientific  and  two  SPEC  benchmark  programs  with  different  checkpoint  sizes  were  se¬ 
lected  for  our  experiments  [18].  Program  convlv  is  an  FFT  algorithm  that  calculates  the  convo¬ 
lution  of  1024  signals  with  one  response,  and  ludcmp  is  the  LU  decomposition  algorithm  that  is 
applied  to  100  randomly  generated  matrices  with  uniformly  distributed  sizefrom  50  to  60;  rkf  is  the 
Runge-Kutta-lehlberg  method  for  solving  the  ordinary  differential  equation  y'  =  x  +  y,  y(0)  =  2 
with  step  size  0.25  and  error  tolerance  5  x  10“^;  rsimp  is  the  revised  Simplex  method  for  solving 
the  linear  optimization  problem  for  the  BRANDY  set  from  the  Argonne  National  Laboratory.  The 
detailed  description  of  these  four  benchmark  programs  can  be  found  in  [16].  The  matrixSOO  and 
nasal  are  two  SPEC  benchmarks:  matrixSOO  performs  various  matri.x  multiplications,  includ¬ 
ing  transposes  using  Linpack  routines  SGEMV.  SGEMM  and  S.WPY,  on  matrices  of  order  300, 
whereas  nasaT  is  a  modified  version  of  N.4S.-\  .Ames  FORTR.AN  kernels  consisting  of  seven  heavily 
floating  point  intensive  modules.  The  original  version  uses  a  large  memory  and  generates  heavy 
paging  activities  on  the  diskless  workstations  that  lead  to  a  long  execution  time  (44  hours).  We 
have  changed  some  array  dimensions  so  that  paging  would  not  delay  our  experiments  (250  K  data 
and  about  2  hours  of  e.xecution). 

The  checkpoint  operations  were  inserted  into  these  benchmark  programs  manually.  Table  4 
summarizes  the  characteristics  for  each  program  with  respect  to  checkpoint  size,  checkpoint  time, 
checkpoint  interval  and  execution  time.  Checkpoint  size  is  divided  into  data  segment,  stack  segment 
and  the  file  output  during  the  checkpoint  interval.  Programs  rsimp  and  matrixSOO  give  examples 
of  a  large  checkpoint.  .Most  applications  we  examined  liave  checkpoints  of  size  (64-3.50  K).  The 
stack  size  is  small  in  all  six  programs.  This  is  expected  for  scientific  applications  in  which  the  calling 
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Table  4.  Overhead  Measurements. 


Programs 

Name 

#  ckp 
(per  run) 

ckp-size 

(data/stack/file) 

(in  bytes) 

ckp.time 
(std.  dev.) 
(in  sec) 

cmp.time 
(std.  dev.) 
(in  sec) 

ckp.interval 
(std.  dev.) 
(in  sec) 

exec.time 
(w/o.  ckp) 
(in  sec) 

convlv 

128 

75950 

(66196/1554/8200) 

0.2172 

(0.3411) 

0.1608 

(8.6302e-3) 

13.917 

(0.90787) 

1809.22 

(1781.42) 

ludcmp 

50 

121510 

(71708/1550/48252) 

0.2408 

(3.428e-2) 

0.2030 

(1.8224e-2) 

20.626 

(2.1092) 

1043.38 

(1031.34) 

matrixSOO 

150 

2219446 

(2217652/1794/0) 

5.8714 

(0.6949) 

8.6157 

(0.2338) 

239.777 

(26.729) 

37092.88 

(36206.30) 

nasa? 

49 

351614 

(349788/1826/0) 

0.7672 

(0.1347) 

0.9660 

(5.683e-2) 

131.46 

(28.22) 

6611.44 

(6573.00) 

rkf 

88 

51777 

(46972/1734/3071) 

0.1477 

(2.563e-2) 

0.1492 

(7.2498e-3) 

29.7202 

(1.0840) 

2638.58 

(2625.58) 

rsimp 

59 

995314 

(991676/3638/0) 

2.411 

(0.3767) 

3.8286 

(0.21893) 

42.8063 

(8.6359) 

2713.04 

(2568.38) 

depth  is  rather  limited.  The  file  output  size  can  be  large  in  some  applications  (e.g.,  convlv). 

In  Table  4,  both  ckpJime  and  cmp.time  do  not  include  the  processing  time  for  the  file  output 
portion  of  the  checkpoint.  For  ckpJime,  the  file  output  portion  is  already  written  to  disk  during 
execution;  thus,  it  is  not  necessary  to  rewrite  this  portion  to  the  checkpoint.  Three  variables  in 
a  checkpoint  are  enough  to  locate  this  file  output  portion  (file  name,  starting  position  and  length 
for  each  output  file).  We  have  found  that  checkpoint  time,  comparison  time  and  restart  time  are 
highly  correlated.  Since  file  I/O  operations  are  the  major  part  of  checkpointing  (write),  checkpoint 
comparison  and  restart  (read),  the  overhead  costs  such  as  checkpoint  time,  comparison  time  and 
restart  time  can  be  expected  to  be  proportional  to  the  size  of  the  checkpoint  files. 


B.2.  Error  Detection  by  Checkpoint  Comparison 


The  effectiveness  of  checkpoint  comparison  is  studied  for  the  six  selected  programs.  To  avoid 


the  interference  of  run-time  error  injection  with  checkpoint  comparability,  a  random  bit  or  word 
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Table  o.  Detectability  Experiments. 


Program 

bit-wise 

errors 

word 

-wise 

errors 

#  Errors  Detected 

# 

Missed 

^  Errors  Detected 

# 

Missed 

data 

stack  file 

abort 

data 

stack 

file 

abort 

convlv 

68 

3  30 

0 

0 

71 

0 

30 

0 

0 

ludcmp 

43 

0  58 

0 

0 

37 

3 

59 

2 

0 

matrix300 

101 

0 

0 

0 

100 

0 

- 

1 

0 

nasa7 

87 

0 

0 

14 

87 

0 

- 

0 

14 

rkf 

78 

1  22 

0 

0 

76 

3 

22 

0 

0 

rsimp 

99 

0 

2 

0 

98 

0 

- 

2^ 

1 

error  is  injected  in  the  previous  checkpoint  to  model  a  transient  error  occurrence  during  its  sub¬ 
sequent  checkpoint  interval.  One  task  is  started  from  this  erroneous  checkpoint  and  another  task 
from  the  error-free  checkpoint.  The  checkpoints  produced  by  the  two  tasks  after  one  checkpoint 
interval  are  compared.  A  mismatch  indicates  a  detected  error.  Table  5  summarizes  the  results  for 
101  injected  random  errors.  The  number  of  errors  detected  is  categorized  by  where  the  error  is 
detected:  the  data,  stack  and  the  file  output  segments  of  the  checkpoints.  The  abortion  of  the  task 
due  to  an  error  in  the  checkpoint  can  be  treated  as  a  special  case  of  error  detection  by  sending  an 
abortion  signal  to  the  voter  explicitly. 

The  errors  detected  by  checkpoint  comparison  account  for  the  majority  of  injected  errors  that 
occurred  (about  98%)  for  all  programs  e.xcept  nasal.  If  the  file  output  during  the  checkpoint 
interval  is  not  included  in  the  check)ioint  structure.  22  to  59%  of  the  errors  would  not  be  detected 
(rkf,  convlv  and  ludcmp).  Some  errors  were  missed  in  our  experiments.  In  this  case,  we  have  a 
valid  file  output  during  execution  and  a  valid  checkpoint  at  the  end;  the  missed  errors  are  actually 
masked  off  and  cause  no  problems  with  respect  to  correct  executions.  This  case  occurs  when  an 
error  is  in  a  dead  variable  and  this  variable  is  reinitialized  later.  A  close  look  at  the  checkpoint 
placement  for  nasa7  reveals  that  a  new  array  of  about  1 1%  of  the  total  checkpoint  size  is  computed 
during  the  checkpoint  interval.  The  14  mis.sed  errors  were  probably  inserted  into  the  new  array 
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space  and  were  overwritten  during  the  computation.  In  sum,  the  checkpoint  structure  provided  an 
effective  error  detection  tool  for  the  programs  we  studied. 

B.3.  Performance  Results 

Each  program  was  run  five  times  for  each  pf  to  obtain  the  average  measures.  The  execution 
time  in  our  experiments  is  actually  the  progr  .m  response  time.  It  includes  the  system,  user  and 
blocking  times.  The  analytical  predictions  for  the  relative  execution  time,  number  of  processors, 
and  number  of  checkpoints  are  also  included  in  Figures  10,  11  and  12  to  compare  against  our 
experimental  results.  The  data  were  collected  at  night  to  minimize  the  impact  of  workload. 

In  Figure  10,  the  rela  execution  time  for  the  programs  with  a  moderate  checkpoint  size 
(ludci.'p,  convlv.  nasa7  and  rkf)  is  close  to  the  analytical  zero-overhead  prediction  (solid  curve), 
since  the  overheads  for  those  prograuis  is  very  small  compared  to  their  checkpoint  intervals.  The 
relative  e.xecution  time  for  ..le  programs  with  large  ( heckpoints  (matrixSOO  and  rsimp)  fits  well 
with  the  analytical  prediction  under  a  centralized  file  server  (the  dotted  curve,  assuming  an  overhead 
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level  of  rsimp).  This  increase  in  execution  time  for  large  checkpoints  can  be  explained  by  the  fact 
that  matrixSOO  and  rsimp  are  likely  to  be  blocked  due  to  the  large  file  I/O  operations  during 
checkpointing  and  comparison.  In  fact,  the  limited  speed  of  the  NFS  file  handling  and  our  use 
of  the  file  server  for  managing  checkpoints  centrally  resulted  in  a  performance  bottleneck.  The 
paging  activities  from  the  replicated  processes  also  contribute  to  the  increase  in  execution  time. 
The  relative  execution  time  increases  significantly  for  high  error  rates  due  to  the  heavy  file  server 
activities  during  checkpointing  and  comparison  of  checkpoints.  This  suggests  that  a  reduction  in 
checkpoint  size,  an  increase  in  file  system  speed,  or  other  noncentralized  server  implementations 
may  improve  the  e.xecution  time  over  that  of  our  current  implementation.  The  Re  fluctuations  in 
Figure  10  are  caused  by  the  uneven  network  workload  distribution  at  the  time  of  data  collection 
and  the  small  sample  size  (5)  at  the  low  failure  rates. 

For  the  pj  we  considered,  the  number  of  processors  used,  Np,  is  less  than  the  three  that  TMR 
requires,  although  DMR-F-l  uses  two  more  processors  momentarily  during  lookahead /validation 
operations;  Np  is  quite  insensitive  to  checkpoint  size  (Figure  11).  The  number  of  checkpoints,  Nc, 
is  highly  sensitive  to  the  workload  and  checkpoint  size,  as  a  result  of  the  checkpoint  accumulation 
in  the  file  system  due  to  the  uneven  processor  speed  (Figure  12),  especailly  for  the  programs  with 
large  chekpoint  sizes. 


V.  Conclusions 

In  this  paper,  we  have  described  a  checkpoint-based  recovery  strategy  using  optimistic  e.x¬ 
ecution  and  rollback  validation  for  parallel  and  distributed  .systems.  This  approach  can  reduce 
rollbacks  without  depending  on  specific  erroi -correction  knowledge  or  the  standard  TMR  redun¬ 
dancy  Our  recovery  schemes  (DMR-F-i  and  DMR-F-2)  can  achieve  a  nearly  error-free  e.xecution 
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time  with  an  average  redundancy  less  than  that  for  TMR.  In  addition,  our  analysis  and  experiments 
have  shown  that  checkpoint  comparison  time  has  more  impact  on  performance  degradation  than 
restart  time.  The  impact  of  the  centralized  file  server  is  also  significant  for  the  competing  processes 
during  checkpointing,  especially  for  the  ones  with  large  checkpoints.  Checkpoint  comparison  was 
an  effective  means  of  error  detection  and  checkpoint  validation. 
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APPENDIX:  Analytical  Deriv.ations 

Due  to  space  limitation,  we  present  the  analysis  only  for  DMR-F-1  in  this  appendix.  The 
analysis  for  DMR-F-2,  DMR-B-1,  DMR-B-2  and  TMR-F  is  similar  [13].  If  an  error  occurs,  DMR- 
F-1  behaves  differently  in  the  la.st  checkpoint  interval.  Since  no  lookahead  execution  is  possible 
for  this  interval,  a  rollback  is  always  required  during  recovery.  An  n-session  computation  consists 
of  n-1  lookaheadable  sessions  followed  by  a  rollbackable  session.  Duda  has  an  excellent  analysis 
for  the  last  rollbackable  session  [5].  Besides,  the  performance  degradation  contribution  of  the  last 
rollbackable  session  is  proportional  to  while  that  of  the  n-1  lookaheadable  sessions  is  to 
The  approximation  of  an  n-session  computation  with  an  u  lookaheadable  sessions  is  adequate  even 
for  a  moderate  n.  Therefore,  our  analysis  focuses  on  the  situation  of  an  n  lookaheadable  sessions. 

Let  Tn  be  the  expected  e.xecution  time  for  an  n-session  (lookaheadable)  computation,  and  Let 
Pi  and  pr  be  the  probabilities  of  a  successful  lookahead  and  rollback  in  DMR-F-1,  respectively. 
Thus,  the  expected  execution  time  is  then. 
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A  +  4+T„_i,  l-pi-pr 

*  A  +  tk  +  tr  +  2.5<t  +  Tn-l,  Pi 
2A  +  ‘2tk  +  ‘2tr  +  Stt  +  Tn,  Pt 


or, 


Tn  —  (A  +  tk)  +  r— ^ (<r  +  2.5<!)  f  - (2A  +  2tk  +  2tr  +  3<t)  +  Tn-\- 

I  -  Pr  1  -  Pr 

Solving  this  equation  with  the  initial  condition,  =  0,  we  have. 


Tg  —  Tn  —  n(A  +  tk)  +  ■; — — — (4  +  ‘2.5tt)  +  - — — — (2A  +  2tjt  +  21^  +  3^^), 

1  -  Pr  1  -  Pr 

R  =  _  1  I  I  P' 

*  To  1  -  Pr  1  -  Pr  A  +  4  1  -  Pr  A  +  tjfe  ’ 

The  number  of  processors  used  is  two  for  the  normal  execution.  It  is  five  during  recovery  for 
transient  faults  and  four  for  permanent  faults,  since  a  faulty  processor  is  used  in  the  latter  case. 
In  this  paper,  we  analyse  only  the  situation  of  transient  faults,  since  this  gives  the  worst  situation. 
Let  /  and  r  denote  and  respectively.  Therefore, 

i— pr  I'^Pr  ^  *' 

T 

f  Np(t)dt  =  2(ti  —  l)(A  tk)  2lt(  bl{A  tk  tr  1.5tj)  + 

Jo 

+27’(  A  +  </.•  +  tr  +  <()  +  5r(  A  +  tk  +  ir  +  2<( ) 

P'  +  Pr  ,  .  Pi  +  Pr  ,  ,  l-S/J/  +  2pr 

=  2Tc  "I"  37q  ■  '■  —  + 
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Np  =  2  +  3 


I  -  Pr 
Pi  +  Pr 


1  -  Pr 


1  -  Pr 


+  3 


Pl  +  Pr  4  ,  _1.5p/  +  2pr  tt 


+  3- 


(l-pr)Rt  (l-pr)RtA  +  tk  {l-pr)Rg  A  +  tk' 


There  is  one  (the  committed)  checkpoint  during  the  normal  execution  run.  Two  additional 
(uncommitted)  checkpoints  are  present  during  a  lookahead/validation  operation.  .4t  the  end  of  the 
A  for  the  rollback  validation,  there  are  eight  checkpoints,  one  committed  and  seven  uncommitted 
(one  for  the  validation  process,  two  for  the  normal  process  pair  and  four  for  the  lookahead  processes). 


Thus, 
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(l-Pr)^e  (1  -  /^r)fle  A  +  it  (  1  -  Pr  )-fte  A  +  it  ' 

If  a  centralized  file  server  serializes  the  file  accesses,  both  the  restart  (i,)  and  checkpoint  times 
(it)  will  be  increased  because  of  the  serialized  file  accesses  generated  by  checkpointing  (write)  and 
restart  (read).  According  to  our  assumption  in  Section  III,  the  restart  time  will  be  threefold  since 
three  processes  read  the  last  committed  checkpoint  file  at  the  same  time  (the  rollback  validation  and 
the  two  lookahead  processes).  The  checkpoint  time  is  2ifc  for  the  normal  pair  of  task  replications 
and  5ifc  for  the  lookahead  period  (four  for  the  lookahead  and  one  for  the  rollback  validation).  Thus, 
the  relative  e.xecution  time  with  a  centralized  file  server  can  be  shown  as 


„  ,  ,  ,  _  ,  ,  2pr  ,  P/  +  5/3pr  3<r  ,  2.5p,  +  3pr  tf 

fl.(M  =  ‘ + rr;: +  A—*; i-p. 


Following  a  similar  analysis,  we  can  obtain  the  formulas  in  Tables  2  and  3. 
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